AITopics

2605.24741

Country: North America > United States (0.45)

Genre: Research Report > New Finding (0.87)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.62)

arXiv.org Machine LearningApr-7-2026

Robust Regression with Adaptive Contamination in Response: Optimal Rates and Computational Barriers

Diakonikolas, Ilias, Gao, Chao, Kane, Daniel M., Pensia, Ankit, Xie, Dong

We study robust regression under a contamination model in which covariates are clean while the responses may be corrupted in an adaptive manner. Unlike the classical Huber's contamination model, where both covariates and responses may be contaminated and consistent estimation is impossible when the contamination proportion is a non-vanishing constant, it turns out that the clean-covariate setting admits strictly improved statistical guarantees. Specifically, we show that the additional information in the clean covariates can be carefully exploited to construct an estimator that achieves a better estimation rate than that attainable under Huber contamination. In contrast to the Huber model, this improved rate implies consistency even when the contamination is a constant. A matching minimax lower bound is established using Fano's inequality together with the construction of contamination processes that match $m> 2$ distributions simultaneously, extending the previous two-point lower bound argument in Huber's setting. Despite the improvement over the Huber model from an information-theoretic perspective, we provide formal evidence -- in the form of Statistical Query and Low-Degree Polynomial lower bounds -- that the problem exhibits strong information-computation gaps. Our results strongly suggest that the information-theoretic improvements cannot be achieved by polynomial-time algorithms, revealing a fundamental gap between information-theoretic and computational limits in robust regression with clean covariates.

artificial intelligence, machine learning, np0, (17 more...)

2604.04228

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(4 more...)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Verchand, Kabir Aladin, Pensia, Ankit, Haque, Saminul, Kuditipudi, Rohith

High-dimensional estimation with missing data: Statistical and computational limits

arXiv.org Machine LearningMar-18-2026

We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

artificial intelligence, data quality, machine learning, (18 more...)

2603.16712

Country:

North America > United States > California (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre:

Workflow (0.92)
Research Report > New Finding (0.45)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.46)
(2 more...)

Diakonikolas, Ilias, Kane, Daniel M., Pittas, Thanasis

High-Dimensional Gaussian Mean Estimation under Realizable Contamination

arXiv.org Machine LearningMar-18-2026

We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $ε$-contamination model. In this model an adversary can choose a function $r(x)$ between 0 and $ε$ and each sample $x$ goes missing with probability $r(x)$. Recent work Ma et al., 2024 proposed this model as an intermediate-strength setting between Missing Completely At Random (MCAR) -- where missingness is independent of the data -- and Missing Not At Random (MNAR) -- where missingness may depend arbitrarily on the sample values and can lead to non-identifiability issues. That work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model. Their proposed estimators incur runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information-computation gap in the Statistical Query model (and, as a corollary, for Low-Degree Polynomials and PTF tests), showing that algorithms must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our SQ lower bound with an algorithm whose sample-time tradeoff nearly matches our lower bound. Together, these results qualitatively characterize the complexity of Gaussian mean estimation under $ε$-realizable contamination.

algorithm, artificial intelligence, machine learning, (19 more...)

2603.16798

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Neural Information Processing SystemsMar-17-2026, 18:54:04 GMT

Optimal Shrinkage of Singular Values Under Random Data Contamination

A low rank matrix X has been contaminated by uniformly distributed noise, missing values, outliers and corrupt entries. Reconstruction of X from the singular values and singular vectors of the contaminated matrix Y is a key problem in machine learning, computer vision and data science. In this paper we show that common contamination models (including arbitrary combinations of uniform noise, missing values, outliers and corrupt entries) can be described efficiently using a single framework. We develop an asymptotically optimal algorithm that estimates X by manipulation of the singular values of Y, which applies to any of the contamination models considered. Finally, we find an explicit signal-to-noise cutoff, below which estimation of X from the singular value decomposition of Y must fail, in a well-defined sense.

artificial intelligence, machine learning, proceedings, (6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.42)

Neural Information Processing SystemsFeb-7-2026, 16:15:45 GMT

1ae6464c6b5d51b363d7d96f97132c75-Paper.pdf

Robust learning is a critical field that seeks to develop efficient algorithms that can recover an underlying model despite possibly malicious corruptions in the data. In recent decades, being able to deal with corrupted measurements has become of crucial importance.

algorithm, artificial intelligence, machine learning, (19 more...)

Country:

Europe > Switzerland > Vaud > Lausanne (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Neural Information Processing SystemsDec-24-2025, 12:48:13 GMT

On Learning Ising Models under Huber's Contamination Model

We study the problem of learning Ising models in a setting where some of the samples from the underlying distribution can be arbitrarily corrupted. In such a setup, we aim to design statistically optimal estimators in a high-dimensional scaling in which the number of nodes p, the number of edges k and the maximal node degree d are allowed to increase to infinity as a function of the sample size n. Our analysis is based on exploiting moments of the underlying distribution, coupled with novel reductions to univariate estimation. Our proposed estimators achieve an optimal dimension independent dependence on the fraction of corrupted data in the contaminated setting, while also simultaneously achieving high-probability error guarantees with optimal sample-complexity. We corroborate our theoretical results by simulations.

contamination model, learning ising model, name change, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.42)

Neural Information Processing SystemsNov-21-2025, 16:16:12 GMT

Optimal Shrinkage of Singular Values Under Random Data Contamination

optimal shrinkage, random data contamination, singular value, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.42)

Danny Barash, Matan Gavish

Optimal Shrinkage of Singular Values Under Random Data Contamination

Neural Information Processing SystemsNov-21-2025, 13:46:55 GMT

Reconstruction of low-rank matrices from noisy and otherwise contaminated data is a key problem in machine learning, computer vision and data science.

artificial intelligence, machine learning, singular value, (16 more...)

Country:

Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Neural Information Processing SystemsOct-2-2025, 04:10:45 GMT

Outlier Robust Mean Estimation with Subgaussian Rates via Stability

We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition.

artificial intelligence, estimator, machine learning, (16 more...)

Country: North America > United States (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Data Science (0.68)